Exploring the Pedestrian Traffic Data and Developmental Activity Footprint to Map and Predict Graffiti Vandalism and Waste Dumping Hotspots across the City of Melbourne¶

Exploring the Pedestrian Traffic Data and Developmental Activity Footprint to Map and Predict Graffiti Vandalism and Waste Dumping Hotspots across the City of Melbourne
Authored by: Supratim Dobhal
Duration: {90} mins
Level: Intermediate
Pre-requisite Skills: {Python, Data Wrangling, Exploratory Data Analysis, Geospatial Analysis, Time Series Analysis, Clustering, Predictive Modelling, Machine Learning}

Scenario¶

In the city of Melbourne, the people dwelling here and the local authorities have observed that there have been increasing cases of graffiti vandalism and illegal waste dumping by the day. Such activities affect the people living around as they not only cause severe deterioration in the aesthetic appeal of the place, but also incur significant amount of resources to clean and repair by the governmental authorities.

The city wants to deal with this issue by proactively identifying, monitoring and maintaining areas that are susceptible to such activities. The goal is to use the historical data on such incidents, pedestrian traffic data, and developmental footprint combined with socio economical factors, temporal factors and geospatial information to develop a system which can map existing graffiti vandalism and illegal waste dumping hotspots while also predicting future areas where this could happen.

What this Use Case Will Teach You¶

At the end of this use case you will:

  • Learn how to Load Data using APIs
  • Understanding how to interpret, clean, prepare and process data.
  • Integrating Data from Multiple Sources
  • Performing Exploratory Data Analysis
  • Data Driven Decision Making
  • Visualizing, Mapping and Analysing Geospatial Information
  • Building a Prediction Model
  • Recommending Cost Effective and Strategic Solutions to Optimize Resource Allocation

Introduction and Background¶

Graffiti Vandalism and illegal waste dumping are two of the key pain points that affect the quality of living for the residents and for the people who visit the city. They also lead to allocation of hefty amounts of resources by the local authorities. Conventionally, the local authorities have addressing this issue on a very ad-hoc bases, dealing with complaints as and when they arrive and get reported by someone. However, this approach is not the most efficient, as it fails to take account of the factors causing this problem and is unable to predict future problem areas or recommend preventive solutions.

The best approach to tackle this challenge would be to come up with a solution with the help of data driven predictive analysis, where we ca use historical data such as pedestrian traffic activity, geospatial information, temporal activity, developmental footprint and public complaints to forecast or predict where these activities are most likely to happen. By analysing these factors, concerned authorities will be able to allocate their resources efficiently wherever and whenever required. It will also give them the ability to take preventive measures to eradicate this problem.

For this use case, we will use techniques such as geospatial analysis, time series analysis, identification of correlations, data wrangling and machine learning techniques to identify, map and predict graffiti and waste dumping hotspots. We will also be including introducing other analytical, visualization and predictive tools to enhance this use case.

Datasets Used¶

  1. Pedestrian Counting System - Sensor Locataions : https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-sensor-locations/information/
  2. Pedestrian Counting System - Counts Per Minute : https://melbournetestbed.opendatasoft.com/explore/dataset/pedestrian-counting-system-past-hour-counts-per-minute/table/?sort=sensing_datetime
  3. Customer Service Request with Resolution Time : https://data.melbourne.vic.gov.au/explore/dataset/customer-service-requests-with-resolution-time/table/?refine.category=Graffiti
  4. Developmental Activity Model Footprint : https://data.melbourne.vic.gov.au/explore/dataset/development-activity-model-footprints/table/

Implementaion¶

  1. Importing necessary libraries
  2. Loading the Data using API
  3. Exploratory Data Analysis
  • Inspecting the Datasets
  • Deriving Summary Statistics
  • Data Cleaning and Feature Engineering
  • Visualizations
  1. Merging the Data
  2. GeoSpatial Analysis
  3. Predictive Modelling
  4. Analysing Model Performance
  5. Providing meaningful recommendations

Importing the necessary libraries and packages¶

The very first thing that we need to do is to import all the neecessary packages and libraries that we would be needing to work on this use case. It's best to figure out what exactly do you aim to achieve and what tools would you require to ahieve it. This would make sure that we do not face any issue later in this use case.

For this use case, we know that we'll be dealing with pedestrian traffic data and customer service request data. Therefore, we will be importing packages like pandas for dataframe operations. We will also use libraries like matplotlib, seaborn and folium to visualize the data, map hotspots, to interpret the relationship between the data and to present our findings.

In [111]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import os
os.environ['OMP_NUM_THREADS'] = '1'
from sklearn.preprocessing import StandardScaler
import geopandas as gpd
from shapely.geometry import Point
import requests
from io import StringIO
import folium
from folium.plugins import MarkerCluster

Defining a Function to download data through API¶

We know that we need to use the datasets availbale from the Melbourne Open Data Library and we need to use API to load the data instead of just manually downloading a file. The reason for this is that a dataset can be modified or updated in the future, so to make sure that this work remains reproducible, we will use APIs instead.

The function below defines an API call to download the datasets from the Melbourne Open Data platform in CSV format. It takes the dataset identifier as a parameter and creates a URL using this unique identifier, and further uses Requests library to fetch the data. This data is then parsed into a Pandas dataframe so that we can perform operations and analyse it.

In [112]:
def API_Unlimited(datasetname): # pass in dataset name and api key
    dataset_id = datasetname

    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
        print(datasetname.sample(10, random_state=999)) # Test
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))

Downloading the Required Datasets from City of Melbourne Open Data Website¶

Now that we have our packages and API function ready, let us download the data that we require for this Use Case. For this, we would need the unique dataset identifiers that will passed as parameters when we create the API call. The dataset identifier can be fetched from the Melbourne Open Data library, where you can search for your dataset and you will see the identifier information in the Overview or Details section.

We will be using four datasets in this Use Case as specified below:

  1. Pedestrian Counting System - Sensor Locations
  2. Pedestrian Counting System - Counts per minute
  3. Customer Service Requests with Resolution Time
  4. Developmental Activity Model Footprint
In [113]:
download_link_1 = 'pedestrian-counting-system-sensor-locations'
download_link_2 = 'pedestrian-counting-system-past-hour-counts-per-minute'
download_link_3 = 'customer-service-requests-with-resolution-time'
download_link_4 = 'development-activity-model-footprints'


# Use functions to download and load data
Pedestrian_SensorLoc = API_Unlimited(download_link_1)
Pedestrian_CountsPerMinute = API_Unlimited(download_link_2)
Reports_ServiceRequest = API_Unlimited(download_link_3)
Development_Activity_Footprint = API_Unlimited(download_link_4)
     location_id                                 sensor_description  \
61            65                          Swanston St - City Square   
93            17                              Collins Place (South)   
29            87                                    Errol St (West)   
126          137     COM Pole 2353 - Towards the city, NAB Building   
0              2                         Bourke Street Mall (South)   
24            72                                  Flinders St- ACMI   
18            43                       Monash Rd-Swanston St (West)   
125          131  I-Hub Corner of King Street and Flinders Stree...   
53            41                     Flinders La-Swanston St (West)   
140          166                                 484 Spencer Street   

      sensor_name installation_date  \
61        SwaCs_T        2020-03-12   
93        Col15_T        2009-03-30   
29      Errol23_T        2022-05-20   
126  BouHbr2353_T        2023-11-03   
0        Bou283_T        2009-03-30   
24         ACMI_T        2020-11-30   
18          UM2_T        2015-04-15   
125       King2_T        2023-09-25   
53          Swa31        2017-06-29   
140     Spen484_T        2024-09-25   

                                           note location_type status  \
61                                          NaN       Outdoor      A   
93             Device is upgraded in 26/02/2020       Outdoor      A   
29                                          NaN       Outdoor      A   
126                                         NaN       Outdoor      A   
0                                           NaN       Outdoor      A   
24                                          NaN       Outdoor      A   
18                                          NaN       Outdoor      A   
125                                         NaN       Outdoor      A   
53                                          NaN       Outdoor      A   
140  Former sensor 227 Bourke Street – City Lab       Outdoor      A   

    direction_1 direction_2   latitude   longitude                    location  
61        North       South -37.815694  144.966806   -37.81569416, 144.9668064  
93         East        West -37.813625  144.973236  -37.81362543, 144.97323591  
29        North       South -37.804549  144.949219  -37.80454949, 144.94921863  
126        East        West -37.818948  144.946123  -37.81894815, 144.94612292  
0          East        West -37.813807  144.965167  -37.81380668, 144.96516718  
24         East        West -37.817263  144.968728  -37.81726338, 144.96872809  
18        North       South -37.798445  144.964118  -37.79844526, 144.96411782  
125       North       South -37.820091  144.957587  -37.82009057, 144.95758725  
53        North       South -37.816686  144.966897  -37.81668634, 144.96689733  
140       North       South -37.808967  144.949317  -37.80896733, 144.94931703  
        location_id           sensing_datetime sensing_date sensing_time  \
111220          166  2024-12-13T02:55:00+00:00   2024-12-13        13:55   
100759           63  2024-12-12T11:03:00+00:00   2024-12-12        22:03   
55994            75  2024-12-11T23:42:00+00:00   2024-12-12        10:42   
25763            14  2024-12-11T21:31:00+00:00   2024-12-12        08:31   
116212           67  2024-12-13T06:50:00+00:00   2024-12-13        17:50   
82999            36  2024-12-13T00:21:00+00:00   2024-12-13        11:21   
28487            56  2024-12-11T22:34:00+00:00   2024-12-12        09:34   
52830            66  2024-12-11T22:05:00+00:00   2024-12-12        09:05   
14938            70  2024-12-11T14:33:00+00:00   2024-12-12        01:33   
81131            29  2024-12-12T22:54:00+00:00   2024-12-13        09:54   

        direction_1  direction_2  total_of_directions  
111220            5            5                   10  
100759            0            3                    3  
55994             0            1                    1  
25763             7            9                   16  
116212            8           18                   26  
82999             0            2                    2  
28487             3            1                    4  
52830             1            8                    9  
14938             0            1                    1  
81131             2            5                    7  
      request_status date_received date_completed          suburb  \
22471         CLOSED    2016-06-03     2016-06-09             NaN   
4040          CLOSED    2015-04-21     2015-04-23             NaN   
23604         CLOSED    2016-02-09     2016-02-17       Melbourne   
3143          CLOSED    2016-06-08     2016-06-10             NaN   
14416         CLOSED    2015-03-17     2015-03-23      Kensington   
39708         CLOSED    2015-07-10     2015-07-13             NaN   
31267         CLOSED    2016-05-10     2016-05-11             NaN   
33638         CLOSED    2014-11-28     2014-12-04  West Melbourne   
44158         CLOSED    2016-01-16     2016-02-26       Parkville   
46129         CLOSED    2015-05-11     2015-05-13      Kensington   

                                category               service_desc  \
22471                  Asset maintenance         Street Maintenance   
4040   Waste, Street Cleaning and Litter                Missing Bin   
23604  Waste, Street Cleaning and Litter   Street Cleaning services   
3143                   Roads and Traffic         Traffic Management   
14416  Waste, Street Cleaning and Litter                Missing Bin   
39708                           Graffiti           Graffiti Removal   
31267  Waste, Street Cleaning and Litter  Waste collection services   
33638  Waste, Street Cleaning and Litter             Dumped Rubbish   
44158                           Graffiti           Graffiti Removal   
46129                    Parks and Trees              Park Cleaning   

       days_to_complete  
22471                 6  
4040                  2  
23604                 8  
3143                  2  
14416                 6  
39708                 3  
31267                 1  
33638                 6  
44158                41  
46129                 2  
       dev_key              status      permit_num  bldhgt_ahd  base_ahd  \
592   X0004405  UNDER CONSTRUCTION     TPM-2018-14        90.9       3.0   
346   X0014301  UNDER CONSTRUCTION    TP-2019-80/B        44.3      28.5   
1270  X0015901             APPLIED  TPMR-2014-47/C        34.0       2.0   
902    X000088           COMPLETED     TPM-2008-61       200.0      27.9   
911    X001009           COMPLETED      TPD-2013-3        90.3       4.5   
350   X0014703            APPROVED      TPM-2023-8        79.4      27.6   
470    X000612           COMPLETED   TP-2011-479/A        65.9      17.7   
692   X0010625           COMPLETED  TP-2015-1057/B        40.1      33.5   
664   X0009403            APPROVED       ID-2020-5        27.0       3.0   
42     X000181           COMPLETED   TPM-2009-40/A        41.9      14.8   

                                                address  num_floors  \
592               971 Collins Street DOCKLANDS VIC 3008          27   
346                19-21 Bourke Street, Melbourne, 3000           3   
1270               65-73 Haig Street SOUTHBANK VIC 3006          40   
902             76-84 Collins Street MELBOURNE VIC 3000          44   
911             1000 La Trobe Street DOCKLANDS VIC 3008          23   
350            58-66 La Trobe Street MELBOURNE VIC 3000          42   
470   DUO Apartments 423-435 Spencer Street WEST MEL...          17   
692            558-566 Swanston Street CARLTON VIC 3053          13   
664      850-868 Lorimer Street PORT MELBOURNE VIC 3207          11   
42    Australis Apartments 593-611 Little Lonsdale S...          46   

                     land_use_1         land_use_2          land_use_3  \
592               Dwellings 313                NaN                 NaN   
346          Retail (sq.m) 1440                NaN                 NaN   
1270              Dwellings 312  Office (sq.m) 207     Bike Spaces 106   
902         Office (sq.m) 43019    Hotel rooms 252       Car parks 250   
911         Office (sq.m) 31572      Car parks 471  Retail (sq.m) 1613   
350          Office (sq.m) 6172       Dwellings 76     Hotel rooms 160   
470                Dwellings 89  Office (sq.m) 275                 NaN   
692   Student Accommodation 467  Retail (sq.m) 211     Bike Spaces 184   
664         Office (sq.m) 19895      Car parks 246   Retail (sq.m) 638   
42                Dwellings 520      Car parks 324   Retail (sq.m) 456   

     shape_type                   datadate  \
592       tower  2024-10-24T12:53:51+00:00   
346       tower  2024-10-24T12:53:51+00:00   
1270     podium  2024-10-24T12:53:51+00:00   
902   extension  2024-10-24T12:53:51+00:00   
911       tower  2024-10-24T12:53:51+00:00   
350       tower  2024-10-24T12:53:51+00:00   
470       tower  2024-10-24T12:53:51+00:00   
692      podium  2024-10-24T12:53:51+00:00   
664       tower  2024-10-24T12:53:51+00:00   
42       podium  2024-10-24T12:53:51+00:00   

                                          json_geometry  \
592   {"coordinates": [[[144.9380407594, -37.8202882...   
346   {"coordinates": [[[144.9722882905, -37.8117203...   
1270  {"coordinates": [[[144.9568610093, -37.8279409...   
902   {"coordinates": [[[144.9711827772, -37.8138131...   
911   {"coordinates": [[[144.9462976067, -37.8141084...   
350   {"coordinates": [[[144.9670769356, -37.8083218...   
470   {"coordinates": [[[144.949937924, -37.81046591...   
692   {"coordinates": [[[144.9641802596, -37.8034919...   
664   {"coordinates": [[[144.9345807445, -37.8232948...   
42    {"coordinates": [[[144.9540164811, -37.8135964...   

                                 geo_point_2d  
592    -37.820133148523695, 144.9378119685705  
346    -37.81179877937306, 144.97242323440605  
1270   -37.82788918902945, 144.95657772803258  
902   -37.813790219556665, 144.97088563197167  
911    -37.81431942032131, 144.94625557708025  
350      -37.808231395335, 144.96698869155546  
470    -37.81056553765372, 144.94987658526253  
692    -37.803572989384854, 144.9638905418592  
664    -37.82307153571362, 144.93485780441668  
42     -37.81380945720944, 144.95382959938118  

Exploratory Data Analysis¶

1. Inspecting the information in the Datasets for further Preparation¶

Once we are done loading the data, the next step will be inspecting and exploring the data. This is a crucial step before proceeding to clean or prepare the data. It's always useful to look at your data before you decide to make any transformations because we need to understand the data to identify the areas that require attention.

By exploring our datasets, we can identify if we are dealing with Null values, missing values, redundant data, and so on. We also get to know the data types of each field so that we can decide to ignore or impute such values. Moreover, we get to know what each feature represents and how it can be used to our advantage, hence allowing us to make better and well informed decisions.

Displaying the first few rows, helps us understand the structure and the content of our dataset

Pedestrain Counting System - Sensor Location¶

This dataset contains the geospatial information of all sensors that are active across the city of Melbourne. It provides the sensor name and description (Street Name), along with it's latitude and longitude. Each sensor location is identified by a location and provides more information such as installation date, status (if the sensor is active or not) and the direction in which the sensor is facing,

In [114]:
# Display the first few rows of each dataset
print("Dataset 1 - Pedestrian Counting System Sensor Locations:\n\n")
Pedestrian_SensorLoc.head()
Dataset 1 - Pedestrian Counting System Sensor Locations:


Out[114]:
location_id sensor_description sensor_name installation_date note location_type status direction_1 direction_2 latitude longitude location
0 2 Bourke Street Mall (South) Bou283_T 2009-03-30 NaN Outdoor A East West -37.813807 144.965167 -37.81380668, 144.96516718
1 4 Town Hall (West) Swa123_T 2009-03-23 NaN Outdoor A North South -37.814880 144.966088 -37.81487988, 144.9660878
2 6 Flinders Street Station Underpass FliS_T 2009-03-25 Upgraded on 8/09/21 Outdoor A North South -37.819117 144.965583 -37.81911705, 144.96558255
3 8 Webb Bridge WebBN_T 2009-03-24 NaN Outdoor A North South -37.822935 144.947175 -37.82293543, 144.9471751
4 10 Victoria Point BouHbr_T 2009-04-23 NaN Outdoor A East West -37.818765 144.947105 -37.81876474, 144.94710545
Pedestrian Counting System - Counts Per Minute¶

This dataset contains the pedestrain traffic information by storing the pedestrian movement counts that is captured by the active sensors in the city of Melbourne. The activity is captured as counts of movement of pedestrian in each direction of the sensor. Each row is again identified by the location id of the sensor for each minute, and therefore what we have in this dataset is the movement counts per minutes for the past two days.

In [115]:
print("\nDataset 2 - Pedestrian Counting System Counts Per Minute:\n\n")
Pedestrian_CountsPerMinute.head()
Dataset 2 - Pedestrian Counting System Counts Per Minute:


Out[115]:
location_id sensing_datetime sensing_date sensing_time direction_1 direction_2 total_of_directions
0 3 2024-12-11T14:08:00+00:00 2024-12-12 01:08 6 0 6
1 4 2024-12-11T13:57:00+00:00 2024-12-12 00:57 0 5 5
2 4 2024-12-11T13:58:00+00:00 2024-12-12 00:58 1 0 1
3 4 2024-12-11T14:06:00+00:00 2024-12-12 01:06 0 1 1
4 4 2024-12-11T14:07:00+00:00 2024-12-12 01:07 0 4 4
Customer Service Requests for Removal of Graffiti and Illegal Waste Dumping¶

This dataset contains the information of all the service requests made by the residents to the public authorities of Melbourne from the year 2014 to 2016. These service requests relatae to things like removal of graffiti, streat cleaning services, removal of illegal waste dumps, etc. It provides other related information such as the data when the request was received, the date of completion, the suburb where the request came from, number of days to taken to complete the request and a description of the request.

We will use this data to identify areas where we observe high number of historical cases of Graffiti Vandalism or Illegal Waste Dumping. This will be a crucial factor to predict if these areas are susceptible to such incidents in the future.

In [116]:
print("\nDataset 3 - Reportings for Removal of Graffiti and Illegal Waste Dumping:\n\n")
Reports_ServiceRequest.head()
Dataset 3 - Reportings for Removal of Graffiti and Illegal Waste Dumping:


Out[116]:
request_status date_received date_completed suburb category service_desc days_to_complete
0 CLOSED 2016-05-05 2016-05-12 Carlton Graffiti Graffiti Removal 7
1 CLOSED 2014-11-20 2014-11-21 East Melbourne Waste, Street Cleaning and Litter Missed Bin Collection 1
2 CLOSED 2014-11-27 2014-12-01 Carlton Waste, Street Cleaning and Litter Missed Bin Collection 4
3 CLOSED 2015-01-21 2015-01-22 Melbourne Waste, Street Cleaning and Litter Street Cleaning services 1
4 CLOSED 2015-01-23 2015-01-27 Southbank Waste, Street Cleaning and Litter Damaged Bins 4
Developmental Activity Model Footprint¶

This dataset contains the information of developmental activity taking place along with it's address and geospatial information. It contains some important information such as what type of developmental activity is happening, what is the status of this activity, the permit number, address, description, the land used and the geospatial information. This information is crucial for us since it can help us understand the socio-economic factors of different areas in Melbourne. We can use this data to identify if there are any relationship between the Graffiti Vandalism and developmental activity.

In [117]:
print("\nDataset 4 - Developmental Activity Model Footprint:")
Development_Activity_Footprint.head()
Dataset 4 - Developmental Activity Model Footprint:
Out[117]:
dev_key status permit_num bldhgt_ahd base_ahd address num_floors land_use_1 land_use_2 land_use_3 shape_type datadate json_geometry geo_point_2d
0 X000931 COMPLETED TPM-2013-4 321.0 2.3 68-82 Southbank Boulevard SOUTHBANK VIC 3006 99 Dwellings 1060 Car parks 600 Retail (sq.m) 659 tower 2024-10-24T12:53:51+00:00 {"coordinates": [[[144.9637806979, -37.8230519... -37.82315946744557, 144.96395988531373
1 X001215 COMPLETED TPM-2010-31 31.6 1.9 312-320 City Road SOUTHBANK VIC 3006 43 Dwellings 494 Car parks 180 Retail (sq.m) 276 podium 2024-10-24T12:53:51+00:00 {"coordinates": [[[144.9584968027, -37.8271049... -37.827113426654456, 144.95822445674378
2 X000930 COMPLETED TPM-2014-42 30.9 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 podium 2024-10-24T12:53:51+00:00 {"coordinates": [[[144.9637079271, -37.8244040... -37.82409175282328, 144.9637290411646
3 X000930 COMPLETED TPM-2014-42 157.8 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 tower 2024-10-24T12:53:51+00:00 {"coordinates": [[[144.9637079271, -37.8244040... -37.82416188970118, 144.9637093351437
4 X000755 COMPLETED TPM-2014-33 21.6 3.0 68-70 Dorcas Street SOUTHBANK VIC 3006 28 Dwellings 238 Car parks 193 Retail (sq.m) 240 podium 2024-10-24T12:53:51+00:00 {"coordinates": [[[144.9675327506, -37.8303791... -37.83063263010421, 144.9674665959633

2. Dataset Summary and Statistics¶

Now that we understand the structure and content of our datasets a little bit better, we will delve deeper into the data by describing the dataset and by deriving the summary statistics. In this process, we would get the information on the exact datatype of each feature, column names and missing values.

This step is also significant in terms of getting insights into our datasets, helping us understand more about the distribution of data. We get valuable information such as the count, mean, standard deviation, min-max range and interquartile range. This information can be used to identify any anomalies or outliers, imputing missing or null values, performing range operations, and so on.

Descriptive Statistics for Pedestrian Counting System - Sensor Locations Dataset¶
In [118]:
# Display basic information about the laneway dataset
print("Pedestrian Counting System Sensor Locations Dataset Information:\n\n")
Pedestrian_SensorLoc.info()
print("\nPedestrian Counting System Sensor Locations Dataset Summary Statistics:\n\n")
Pedestrian_SensorLoc.describe()
Pedestrian Counting System Sensor Locations Dataset Information:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141 entries, 0 to 140
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location_id         141 non-null    int64  
 1   sensor_description  139 non-null    object 
 2   sensor_name         141 non-null    object 
 3   installation_date   139 non-null    object 
 4   note                33 non-null     object 
 5   location_type       141 non-null    object 
 6   status              141 non-null    object 
 7   direction_1         109 non-null    object 
 8   direction_2         109 non-null    object 
 9   latitude            141 non-null    float64
 10  longitude           141 non-null    float64
 11  location            141 non-null    object 
dtypes: float64(2), int64(1), object(9)
memory usage: 13.3+ KB

Pedestrian Counting System Sensor Locations Dataset Summary Statistics:


Out[118]:
location_id latitude longitude
count 141.000000 141.000000 141.000000
mean 83.716312 -37.812446 144.960506
std 51.682178 0.006850 0.009855
min 1.000000 -37.825910 144.928606
25% 40.000000 -37.817263 144.955570
50% 80.000000 -37.813449 144.961567
75% 135.000000 -37.807784 144.966233
max 166.000000 -37.789353 144.986388
Descriptive Statistics for Pedestrian Counting System - Counts Per Minute Dataset¶
In [119]:
# Display basic information about the laneway dataset
print("Pedestrian Counts Per Minute Dataset Information:\n\n")
Pedestrian_CountsPerMinute.info()
print("\nPedestrian Counts Per Minute Dataset Summary Statistics:\n\n")
Pedestrian_CountsPerMinute.describe()
Pedestrian Counts Per Minute Dataset Information:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116351 entries, 0 to 116350
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   location_id          116351 non-null  int64 
 1   sensing_datetime     116351 non-null  object
 2   sensing_date         116351 non-null  object
 3   sensing_time         116351 non-null  object
 4   direction_1          116351 non-null  int64 
 5   direction_2          116351 non-null  int64 
 6   total_of_directions  116351 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 6.2+ MB

Pedestrian Counts Per Minute Dataset Summary Statistics:


Out[119]:
location_id direction_1 direction_2 total_of_directions
count 116351.000000 116351.000000 116351.000000 116351.000000
mean 50.417341 6.983859 7.009446 13.993305
std 36.764527 10.844983 11.220387 20.520189
min 1.000000 0.000000 0.000000 1.000000
25% 23.000000 1.000000 1.000000 3.000000
50% 47.000000 3.000000 3.000000 7.000000
75% 68.000000 8.000000 8.000000 17.000000
max 166.000000 168.000000 207.000000 335.000000
Descriptive Statistics for Service Request - Removal of Graffiti and Waste Dumping Dataset¶
In [120]:
# Display basic information about the laneway dataset
print("Service Request - Removal of Graffiti and Waste Dumping Dataset Info:\n\n")
Reports_ServiceRequest.info()
print("\nService Request - Removal of Graffiti and Waste Dumping Dataset Summary Statistics:\n\n")
Reports_ServiceRequest.describe()
Service Request - Removal of Graffiti and Waste Dumping Dataset Info:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46479 entries, 0 to 46478
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   request_status    46479 non-null  object
 1   date_received     46479 non-null  object
 2   date_completed    45631 non-null  object
 3   suburb            21585 non-null  object
 4   category          46479 non-null  object
 5   service_desc      46479 non-null  object
 6   days_to_complete  46479 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 2.5+ MB

Service Request - Removal of Graffiti and Waste Dumping Dataset Summary Statistics:


Out[120]:
days_to_complete
count 46479.000000
mean 15.069279
std 39.651448
min 1.000000
25% 2.000000
50% 5.000000
75% 13.000000
max 724.000000
Descriptive Statistics for Development Activity Model Footprint Dataset¶
In [121]:
# Display basic information about the laneway dataset
print("Development Activity Model Footprint Dataset Info:\n\n")
Development_Activity_Footprint.info()
print("\nDevelopment Activity Model Footprint Dataset Summary Statistics:\n\n")
Development_Activity_Footprint.describe()
Development Activity Model Footprint Dataset Info:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1274 entries, 0 to 1273
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   dev_key        1274 non-null   object 
 1   status         1274 non-null   object 
 2   permit_num     1274 non-null   object 
 3   bldhgt_ahd     1274 non-null   float64
 4   base_ahd       1274 non-null   float64
 5   address        1274 non-null   object 
 6   num_floors     1274 non-null   int64  
 7   land_use_1     1260 non-null   object 
 8   land_use_2     1160 non-null   object 
 9   land_use_3     988 non-null    object 
 10  shape_type     1274 non-null   object 
 11  datadate       1274 non-null   object 
 12  json_geometry  1274 non-null   object 
 13  geo_point_2d   1274 non-null   object 
dtypes: float64(2), int64(1), object(11)
memory usage: 139.5+ KB

Development Activity Model Footprint Dataset Summary Statistics:


Out[121]:
bldhgt_ahd base_ahd num_floors
count 1274.000000 1274.000000 1274.000000
mean 64.646860 13.999137 22.770016
std 51.945998 11.349001 18.492809
min 4.200000 0.000000 0.000000
25% 31.000000 3.000000 8.000000
50% 48.450000 12.500000 17.000000
75% 79.100000 23.175000 33.000000
max 368.100000 48.400000 102.000000

3. Data Cleaning and Feature Engineering¶

After loading, exploring and deriving the summary statistics of the dataset, now, it is time to clean our data. This process is extremely important as our data could be ridden with Null values, missing values, duplicate values, etc. which could severly affect our analysis, visualizations and predictions. We need to ensure that our data is clean and consistent.

We will also be performing feature engineering where we create new columns or feature derived or calculated by the existing features in the dataset. In some cases, we might be presented with information that we can't use effectively as it is, and we need to tranform that information to make it useful. This is exactly what we plan to do here. We will create new features that will be used for further analysis or preedictive modelling.

Pedestrian Sensor Location Dataset¶

For the pedestrian sensor location dataset, we will be processing the latitude and longitude columns and convert it from string to numerical data type. This is important to do because we will be using the the separate values of latitude and longitude to perform analysis and to visualize the our data. We will also be using this information to join our datasets based on proximity parameters.

Columns that are irrelevant for our analysis will then be dropped from this dataset, so that only the important information remains with us.

In [122]:
# Convert latitude and longitude to numeric
Pedestrian_SensorLoc['Latitude'] = pd.to_numeric(Pedestrian_SensorLoc['latitude'], errors='coerce')
Pedestrian_SensorLoc['Longitude'] = pd.to_numeric(Pedestrian_SensorLoc['longitude'], errors='coerce')

# Drop irrelevant columns
Pedestrian_SensorLoc = Pedestrian_SensorLoc.drop(columns=['location', 'note', 'latitude', 'longitude', 'installation_date','status'])

Let's see if the changes are reflected in our dataset

In [123]:
Pedestrian_SensorLoc
Out[123]:
location_id sensor_description sensor_name location_type direction_1 direction_2 Latitude Longitude
0 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167
1 4 Town Hall (West) Swa123_T Outdoor North South -37.814880 144.966088
2 6 Flinders Street Station Underpass FliS_T Outdoor North South -37.819117 144.965583
3 8 Webb Bridge WebBN_T Outdoor North South -37.822935 144.947175
4 10 Victoria Point BouHbr_T Outdoor East West -37.818765 144.947105
... ... ... ... ... ... ... ... ...
136 161 Birrarung Marr - COM - Pole 1109 BirArt1109_T Outdoor East West -37.818513 144.971313
137 164 I-Hub 526 La Trobe Street Lat526_T Outdoor East North -37.813005 144.951604
138 164 I-Hub 526 La Trobe Street Lat526_T Outdoor East South -37.813005 144.951604
139 164 I-Hub 526 La Trobe Street Lat526_T Outdoor North East -37.813005 144.951604
140 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317

141 rows × 8 columns

Checking for Duplicates¶

We flag all the duplicate entries in the location_id column. We then display the duplicate entries so that we can deal with these duplicates before they affect our analysis.

Here we can see that there are a lot of duplicate entries, but if we observe much closely, the two records for the same location id correspond to two different direction of the sensor. Therefore, we must not remove these entries as they will be crucial when we calculate the total movement or the daily traffic at that particular location

In [124]:
# Check for duplicate location_id entries
duplicate_locations = Pedestrian_SensorLoc[Pedestrian_SensorLoc.duplicated(subset=['location_id'], keep=False)]

if not duplicate_locations.empty:
    print("Duplicate location entries found:")
    print(duplicate_locations)
else:
    print("No duplicate location entries found.")
Duplicate location entries found:
     location_id         sensor_description sensor_name location_type  \
7             18      Collins Place (North)     Col12_T       Outdoor   
8             18      Collins Place (North)     Col12_T       Outdoor   
16            39               Alfred Place     AlfPl_T       Outdoor   
17            39               Alfred Place     AlfPl_T       Outdoor   
48           164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   
49           164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   
84           164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   
87             3          Melbourne Central    Swa295_T       Outdoor   
88             3          Melbourne Central    Swa295_T       Outdoor   
89             3          Melbourne Central    Swa295_T       Outdoor   
94            18      Collins Place (North)     Col12_T       Outdoor   
99            39               Alfred Place     AlfPl_T       Outdoor   
137          164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   
138          164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   
139          164  I-Hub 526 La Trobe Street    Lat526_T       Outdoor   

    direction_1 direction_2   Latitude   Longitude  
7          East       InOut -37.813449  144.973054  
8            In         Out -37.813449  144.973054  
16        North       InOut -37.813797  144.969957  
17        North       South -37.813797  144.969957  
48         East        West -37.813005  144.951604  
49        North       South -37.813005  144.951604  
84        North        West -37.813005  144.951604  
87           In         Out -37.811015  144.964295  
88        North       InOut -37.811015  144.964295  
89        North       South -37.811015  144.964295  
94         East        West -37.813449  144.973054  
99           In         Out -37.813797  144.969957  
137        East       North -37.813005  144.951604  
138        East       South -37.813005  144.951604  
139       North        East -37.813005  144.951604  
Pedestrian Counts Per Minute Dataset¶

For this dataset, we will first be converting the sensing_datetime to datetime data type and correct the naming convention of the column. After converting and storing the information to a new column - 'Sensing_Datetime', we can remove our old column which won't be useful going forward. Apart from this, we will again look for any duplicate entries and remove them from our dataset. Unlike the Pedestrian Sensor Location dataset, where we have similar entries with different directions, this Counts Per Minute dataset should not have any duplicate entries for the same timestamp, as the directions here are columns for the dataframe.

We will then make sure that we drop all records with NA values so that it doesn't affect our analysis in any negative way, by using the dropna() function. Hence, for location_id and total_of_directions columns that have NA values, we want to remove all records that do not provide us with any valuable information

In [125]:
# Convert sensing_datetime to datetime
Pedestrian_CountsPerMinute['Sensing_Datetime'] = pd.to_datetime(Pedestrian_CountsPerMinute['sensing_datetime'])

# Drop duplicates
Pedestrian_CountsPerMinute = Pedestrian_CountsPerMinute.drop_duplicates()

# Check for invalid entries
Pedestrian_CountsPerMinute = Pedestrian_CountsPerMinute.dropna(subset=['location_id', 'total_of_directions'])
# Drop irrelevant columns
Pedestrian_CountsPerMinute = Pedestrian_CountsPerMinute.drop(columns=['sensing_datetime'])

Let's see if the changes have been made correctly

In [126]:
Pedestrian_CountsPerMinute
Out[126]:
location_id sensing_date sensing_time direction_1 direction_2 total_of_directions Sensing_Datetime
0 3 2024-12-12 01:08 6 0 6 2024-12-11 14:08:00+00:00
1 4 2024-12-12 00:57 0 5 5 2024-12-11 13:57:00+00:00
2 4 2024-12-12 00:58 1 0 1 2024-12-11 13:58:00+00:00
3 4 2024-12-12 01:06 0 1 1 2024-12-11 14:06:00+00:00
4 4 2024-12-12 01:07 0 4 4 2024-12-11 14:07:00+00:00
... ... ... ... ... ... ... ...
116346 161 2024-12-13 18:10 16 8 24 2024-12-13 07:10:00+00:00
116347 164 2024-12-13 17:20 9 5 14 2024-12-13 06:20:00+00:00
116348 164 2024-12-13 17:45 14 3 17 2024-12-13 06:45:00+00:00
116349 164 2024-12-13 17:50 8 5 13 2024-12-13 06:50:00+00:00
116350 164 2024-12-13 18:00 10 14 24 2024-12-13 07:00:00+00:00

116351 rows × 7 columns

Customer Service Request Dataset¶

As we know, this dataset contains information for every service request coming to the public authorities between from the year 2014 to 2016. Therefore we should expect that there will be all kinds of service requests in our dataset. To only fetch the useful information that concerns us, we will need to filter this dataset on the basis of service requests.

For this, let us first check all types of service requests that we are dealing with, so that we can extract the data for removal of Graffitis and Illegal Waste Dumps

In [127]:
Reports_ServiceRequest['service_desc'].unique()
Out[127]:
array(['Graffiti Removal', 'Missed Bin Collection',
       'Street Cleaning services', 'Damaged Bins', 'Public Litter Bin',
       'Park Cleaning', 'Tree Maintenance Services',
       'Condition of Assets in Parks', 'Road and Footpath Maintenance',
       'Missing Bin', 'Lawns and Irrigation', 'Waste collection services',
       'Dumped Rubbish', 'Syringe pick-up services', 'Street Maintenance',
       'Parking Compliance Services', 'Street Lighting Maintenance',
       'Public Toilets', 'Traffic Management',
       'Sport and Playground Facilities', 'Organic Waste',
       'Parking Meter Service', 'Waste Compactor', 'Drain Maintenance',
       'Bridge Maintenance', 'Bike pod services', 'Waterways'],
      dtype=object)

Now that we know what type of service requests exists in the dataset, we can filter out the data we require for our analysis, that is - 'Graffiti Removal' and 'Dumped Rubbish'. We will use string operations to filter data containing these two words and update our dataset accordingly.

We will then make sure that all our date columns are actually converted to datetime data type by using to_datetime function. To avoid any false values, we use the convert Date_Completed and Date_Received columns to calculate the Days_To_Complete (number of days takeen to complete a service request) column. This would ensure that these values are abosultely correct and in the right format.

Finally we will ensure that there are no records with NA values in the Suburb column, since we are going to use this column indirectly to join our datasets.

In [128]:
# Filter for "Graffiti" or "Waste"-related requests
Reports_ServiceRequest = Reports_ServiceRequest[
    Reports_ServiceRequest['service_desc'].str.contains("Graffiti|Dumped Rubbish", na=False, case=False)
]

# Convert dates to datetime
Reports_ServiceRequest['Date_Received'] = pd.to_datetime(Reports_ServiceRequest['date_received'])
Reports_ServiceRequest['Date_Completed'] = pd.to_datetime(Reports_ServiceRequest['date_completed'])

# Calculate time to resolve
Reports_ServiceRequest['Days_to_Complete'] = (Reports_ServiceRequest['Date_Completed'] - Reports_ServiceRequest['Date_Received']).dt.days

# Remove any NA values from the suburb column
Reports_ServiceRequest = Reports_ServiceRequest.dropna(subset=['suburb'])

Let's have a look at our dataset now

In [129]:
Reports_ServiceRequest
Out[129]:
request_status date_received date_completed suburb category service_desc days_to_complete Date_Received Date_Completed Days_to_Complete
0 CLOSED 2016-05-05 2016-05-12 Carlton Graffiti Graffiti Removal 7 2016-05-05 2016-05-12 7.0
5 CLOSED 2014-12-01 2014-12-08 Carlton Graffiti Graffiti Removal 7 2014-12-01 2014-12-08 7.0
9 CLOSED 2014-12-06 2014-12-15 Parkville Graffiti Graffiti Removal 9 2014-12-06 2014-12-15 9.0
10 CLOSED 2014-12-11 2015-01-22 Melbourne Graffiti Graffiti Removal 42 2014-12-11 2015-01-22 42.0
12 CLOSED 2014-12-17 2015-01-08 Melbourne Graffiti Graffiti Removal 22 2014-12-17 2015-01-08 22.0
... ... ... ... ... ... ... ... ... ... ...
46411 CLOSED 2015-08-10 2015-08-17 Melbourne Graffiti Graffiti Removal 7 2015-08-10 2015-08-17 7.0
46414 CLOSED 2015-12-22 2015-12-23 North Melbourne Waste, Street Cleaning and Litter Dumped Rubbish 1 2015-12-22 2015-12-23 1.0
46418 CLOSED 2016-03-15 2016-03-21 Melbourne Graffiti Graffiti Removal 6 2016-03-15 2016-03-21 6.0
46419 CLOSED 2014-11-26 2014-12-01 Melbourne Graffiti Graffiti Removal 5 2014-11-26 2014-12-01 5.0
46441 CLOSED 2016-01-04 2016-02-19 Carlton Graffiti Graffiti Removal 46 2016-01-04 2016-02-19 46.0

6191 rows × 10 columns

Developmental Activity Model Footprint Dataset¶

This dataset as all the key information that we require to analyse how much development is happening at each location. The data provided is clean and consistent, except the two columns that provide the geospatial information.

Geospatial information is essential for us to connect these datasets as this will be a common factor a,ongst all our datasets. However, to work with locations and to join the datasets, we would need the Latitude and Longitude values separately as we have in our other three datasets. Therefore, we are going to split the coordinates that we have into separate Latitudes and Longitudes using the string split function, and remove the old columns as we won't need them again.

In [130]:
# Extract coordinates
Development_Activity_Footprint[['Latitude', 'Longitude']] = Development_Activity_Footprint['geo_point_2d'].str.split(',', expand=True).astype(float)

# Drop irrelevant columns
Development_Activity_Footprint = Development_Activity_Footprint.drop(columns=['json_geometry', 'geo_point_2d'])

Always best to have a quick look before we proceed

In [132]:
Development_Activity_Footprint.head()
Out[132]:
dev_key status permit_num bldhgt_ahd base_ahd address num_floors land_use_1 land_use_2 land_use_3 shape_type datadate Latitude Longitude
0 X000931 COMPLETED TPM-2013-4 321.0 2.3 68-82 Southbank Boulevard SOUTHBANK VIC 3006 99 Dwellings 1060 Car parks 600 Retail (sq.m) 659 tower 2024-10-24T12:53:51+00:00 -37.823159 144.963960
1 X001215 COMPLETED TPM-2010-31 31.6 1.9 312-320 City Road SOUTHBANK VIC 3006 43 Dwellings 494 Car parks 180 Retail (sq.m) 276 podium 2024-10-24T12:53:51+00:00 -37.827113 144.958224
2 X000930 COMPLETED TPM-2014-42 30.9 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 podium 2024-10-24T12:53:51+00:00 -37.824092 144.963729
3 X000930 COMPLETED TPM-2014-42 157.8 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 tower 2024-10-24T12:53:51+00:00 -37.824162 144.963709
4 X000755 COMPLETED TPM-2014-33 21.6 3.0 68-70 Dorcas Street SOUTHBANK VIC 3006 28 Dwellings 238 Car parks 193 Retail (sq.m) 240 podium 2024-10-24T12:53:51+00:00 -37.830633 144.967467
Checking for Impact on the Distributions after Cleaning the Datasets¶

After cleaning the data, removing duplicates and dropping NA values, the distribution of our datasets would have been impacted for sure. It's important that we have a look that the distributions of all the features before we go on to perform further analysis, make any transformations or create any visualizations.

So, let us use the same method to describe our dataset and look at key summary statistics that might give us a good insight into our data.

In [133]:
# Summary statistics
print("\nGraffiti Requests Summary:\n\n", Reports_ServiceRequest.describe())
print("\n\nPedestrian Sensor Location:\n\n", Pedestrian_SensorLoc.describe())
print("\n\nPedestrian Counts Summary:\n\n", Pedestrian_CountsPerMinute.describe())
print("\nDevelopment Activity Summary:\n\n", Development_Activity_Footprint.describe())
Graffiti Requests Summary:

        days_to_complete                  Date_Received  \
count       6191.000000                           6191   
mean          13.555322  2015-12-31 16:32:01.240510464   
min            1.000000            2014-11-10 00:00:00   
25%            6.000000            2015-07-29 00:00:00   
50%            9.000000            2016-01-29 00:00:00   
75%           17.000000            2016-06-14 12:00:00   
max          505.000000            2016-11-08 00:00:00   
std           16.036231                            NaN   

                      Date_Completed  Days_to_Complete  
count                           6115       6115.000000  
mean   2016-01-10 11:51:52.542927104         13.422077  
min              2014-11-11 00:00:00          0.000000  
25%              2015-08-17 00:00:00          6.000000  
50%              2016-02-09 00:00:00          9.000000  
75%              2016-06-22 00:00:00         17.000000  
max              2016-11-08 00:00:00        505.000000  
std                              NaN         15.771168  


Pedestrian Sensor Location:

        location_id    Latitude   Longitude
count   141.000000  141.000000  141.000000
mean     83.716312  -37.812446  144.960506
std      51.682178    0.006850    0.009855
min       1.000000  -37.825910  144.928606
25%      40.000000  -37.817263  144.955570
50%      80.000000  -37.813449  144.961567
75%     135.000000  -37.807784  144.966233
max     166.000000  -37.789353  144.986388


Pedestrian Counts Summary:

          location_id    direction_1    direction_2  total_of_directions
count  116351.000000  116351.000000  116351.000000        116351.000000
mean       50.417341       6.983859       7.009446            13.993305
std        36.764527      10.844983      11.220387            20.520189
min         1.000000       0.000000       0.000000             1.000000
25%        23.000000       1.000000       1.000000             3.000000
50%        47.000000       3.000000       3.000000             7.000000
75%        68.000000       8.000000       8.000000            17.000000
max       166.000000     168.000000     207.000000           335.000000

Development Activity Summary:

         bldhgt_ahd     base_ahd   num_floors     Latitude    Longitude
count  1274.000000  1274.000000  1274.000000  1274.000000  1274.000000
mean     64.646860    13.999137    22.770016   -37.811459   144.953774
std      51.945998    11.349001    18.492809     0.011618     0.013716
min       4.200000     0.000000     0.000000   -37.849234   144.908009
25%      31.000000     3.000000     8.000000   -37.820170   144.944971
50%      48.450000    12.500000    17.000000   -37.811391   144.955940
75%      79.100000    23.175000    33.000000   -37.804006   144.961944
max     368.100000    48.400000   102.000000   -37.778778   144.989882

4. Visualizing our Data¶

Distribution of Days Taken to Complete a Graffiti Removal or Waste Removal Requests¶

To help us analyse the general amount of time taken for completing the service requests for Graffiti and Waste removal, we will use a histplot with kernel density estimate overlay to plot the distribution of Dats_To_Complete column that we calculate above, from the Reports Service Requests Dataset. This would help us identify any trends, patterns or anomalies in our data.

In [134]:
# Visualize distributions
sns.histplot(Reports_ServiceRequest['Days_to_Complete'], kde=True, bins=20)
plt.title('Days to Complete Graffiti and Waste Removal Requests')
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. The distribution seems to be highly right skewed from the observation. We will need to remove skewness if we decide to use this in our prediction model.
  2. Majority of the service requests are completed in the first 50 days from the date of receival.
  3. We can observe a sharp peak around 0-25 days, which means that most service requests are completed and closed within the first three week.
  4. To resolve some of the cases, it over a 100 days to resolve, sometimes even going up to 500 days.
Pedestrian Sensor Locations - Proximity¶

To identidy geospatial patterns or clusters in the placement of sensors across the city of Melbourne, we can use a scatter plot to visualize the placement (that is, the Latitude and the Longitude) of all the sensors. This would help us understand if the sensors are placed in close proximity to each other, or if they are located far away from each other. This would help us identify the clustered placement of these sensors to make sure they are all within the city of Melbourne. If we identify an outlier, we can then remove it to make our data more consistent.

In [135]:
sns.scatterplot(x=Pedestrian_SensorLoc['Latitude'], y=Pedestrian_SensorLoc['Longitude'])
plt.title('Pedestrian Sensor Locations')
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. We can observe a scattered pattern across the geographical Latitude and Longitude co-ordinates.
  2. Some dense clustering can be observed in the centre which suggests that the sensors are placed within close proximity to each othere in the Melbourne City.
  3. We can see that there are only a few sensors which are placed a little further away from the rest of the sensors, suggesting a closeby suburb.
The Number of Graffiti and Waste Removal Service Requests by Each Year¶

Let us plot the number of service requests that have been coming in over the span of three years (from 2014 to 2016) using a bar chart. Bar Charts are a great tool when it comes to comparing the count distribution for each factor. But before this, we need to aggregate our data by the Year in which the service request was received. To do this, we would first need to extract the Year of request receival from the Date_Received column and then use it to aggregate the data, by grouping the data by Year.

In [136]:
# Number of graffiti requests per year
Reports_ServiceRequest['Year'] = Reports_ServiceRequest['Date_Received'].dt.year
requests_by_year = Reports_ServiceRequest.groupby('Year').size()

requests_by_year.plot(kind='bar', figsize=(10, 6), color='skyblue')
plt.title("Graffiti and Waste Dump Removal Requests by Year")
plt.xlabel("Year")
plt.ylabel("Number of Requests")
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. We can observe a significant increase in the number of service requests being made to public authorities over the span of three years (from 2014 to 2016)
  2. The first year - 2014 observed around 300 service requests for Graffiti and Waste Dumping Removal.
  3. The next two years (2015 and 2016) observed a huge jump in the number of service requests, going up till 3500 requests in the whole of 2016.
  4. This also explains the previous skewed values in terms of the days taken to complete a service request (which goes up to 500 days). The increased number of service requests makes it difficult for the public authorities to attend to all request on time with their limited resources.
Total Captured Pedestrian Count by Each Location ID¶

To understand the density of pedestrian traffic captured by each sensor, we need to compare the total number of movement observed at each location. To do this, we again go to the Bar chart. To project this data, we will first need to group this data just like we did above, by each location id, and take a sum of the total_of_directions column values

In [137]:
# Aggregate pedestrian counts by location
location_counts = Pedestrian_CountsPerMinute.groupby('location_id')['total_of_directions'].sum()

location_counts.plot(kind='bar', figsize=(14, 6), color='green')
plt.title("Total Pedestrian Counts by Location")
plt.xlabel("Location ID")
plt.ylabel("Total Pedestrians")
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. We can observe there is a very wide range of variation in Pedestrian Traffic Movement across the city of Melbourne.
  2. While most locations have Pedestrian Traffic Count ranging in between 10,000 - 30,000.
  3. There are certain locations that exceed to 50,000 captured Pedestrian Traffic Movement.
  4. We can clearly see that the locations with unexpectedly high Pedestrian Traffic could indicate that these are prime locations in the Melbourne city.

5. Merging the Data¶

Merging Pedestrian Traffic Data¶

To help us with a comprehensive analysis, we have to merge Pedestrian Sensor Location data with the Pedestrian Counts Per Minute dataset using the 'location_id' column that is present in both the dataframes. This would be a very straightforward way of joining our datasets and therefore would be the first step while performing the merge operations.

We will be performing an inner join so that only the location_ids that exists in both datasets are stored. This dataset we get as a result of this merge, would provide us consolidated sensor information such as the location_id, sensor description, traffic data, etc.

In [138]:
# Merge pedestrian location with counts
pedestrian_data = pd.merge(Pedestrian_SensorLoc, Pedestrian_CountsPerMinute, on='location_id', how='inner')
In [139]:
pedestrian_data
Out[139]:
location_id sensor_description sensor_name location_type direction_1_x direction_2_x Latitude Longitude sensing_date sensing_time direction_1_y direction_2_y total_of_directions Sensing_Datetime
0 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167 2024-12-12 00:55 0 2 2 2024-12-11 13:55:00+00:00
1 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167 2024-12-12 00:59 0 1 1 2024-12-11 13:59:00+00:00
2 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167 2024-12-12 01:12 1 0 1 2024-12-11 14:12:00+00:00
3 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167 2024-12-12 01:25 0 1 1 2024-12-11 14:25:00+00:00
4 2 Bourke Street Mall (South) Bou283_T Outdoor East West -37.813807 144.965167 2024-12-12 01:29 3 1 4 2024-12-11 14:29:00+00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130644 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317 2024-12-13 17:00 2 7 9 2024-12-13 06:00:00+00:00
130645 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317 2024-12-13 17:15 3 3 6 2024-12-13 06:15:00+00:00
130646 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317 2024-12-13 17:30 4 5 9 2024-12-13 06:30:00+00:00
130647 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317 2024-12-13 17:40 9 8 17 2024-12-13 06:40:00+00:00
130648 166 484 Spencer Street Spen484_T Outdoor North South -37.808967 144.949317 2024-12-13 17:55 6 5 11 2024-12-13 06:55:00+00:00

130649 rows × 14 columns

Aggregating the Pedestrian Traffic Data to Extract the Daily Traffic Density¶

Now that we have a consolidated dataset with all our Pedestrian Sensor and Traffic information, we need to decide what we actually want from this data. For our analysis, it is very important that we understand the pedestrian traffic density at each location since we want to figure out how it impacts or relates to the graffiti incidents.

For this, we will aggregate our pedeestrian traffic data such that we get the daily traffic count for each locatiton and for each day. We would need to convert our date columns to datetime type and rename the total_of_direction column. The resulting dataset will provide us the daily traffic information for each sensor location and date.

In [140]:
# Extract the sensing date for grouping
pedestrian_data['sensing_date'] = pedestrian_data['Sensing_Datetime'].dt.date

# Group by sensor and date, summing up pedestrian counts
daily_aggregated_data = pedestrian_data.groupby(
    ['location_id', 'sensor_description', 'sensor_name', 'Latitude', 'Longitude', 'sensing_date']
).agg({'total_of_directions': 'sum'}).reset_index()

# Rename column for clarity
daily_aggregated_data.rename(columns={'total_of_directions': 'daily_traffic'}, inplace=True)

# Display the result
daily_aggregated_data.head()
Out[140]:
location_id sensor_description sensor_name Latitude Longitude sensing_date daily_traffic
0 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-11 3423
1 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-12 35198
2 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-13 22523
3 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073
4 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-12 18808

Merging Customer Service Request Data¶

Now that we have our Pedestrian Traffic Information all ready, we need to figure out how to merge this information with the Service Request dataset. When we analyzed this dataset, we observed that unlike the pedestrian traffic data, we did not have the 'location_id' column for the service requests.

However, we do have a column which stored the Suburb name from where the request came in. If we think about it, this can be a key column to pave the way for merging the data. How?

For this, we will use the Nominatim geolocator to fetch the Latitude and Longitude of each Suburb that we have in our dataset, and will also add delay to comply with the geolocation service rate limits. If we don't add a delay, the continuous API calls made from our code will results in a failure. We store the Latitude and Longitude values in a separate dictionary with all unique suburbs. This information is then mapped back to our original dataset using the lambda function.

Geolocation¶
In [141]:
from geopy.geocoders import Nominatim
import time

geolocator = Nominatim(user_agent="my_geocoder")

def get_coordinates_with_delay(suburb):
    try:
        time.sleep(1)  # Add delay to respect rate limits
        location = geolocator.geocode(f"{suburb}, Melbourne, Australia", timeout=10)
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except Exception as e:
        print(f"Error fetching coordinates for {suburb}: {e}")
        return None, None

# Apply function to unique suburbs
unique_suburbs = Reports_ServiceRequest['suburb'].dropna().unique()
suburb_coordinates = {suburb: get_coordinates_with_delay(suburb) for suburb in unique_suburbs}

# Map coordinates back to the dataset
Reports_ServiceRequest.loc[:, 'Latitude'] = Reports_ServiceRequest['suburb'].map(lambda x: suburb_coordinates.get(x, (None, None))[0])
Reports_ServiceRequest.loc[:, 'Longitude'] = Reports_ServiceRequest['suburb'].map(lambda x: suburb_coordinates.get(x, (None, None))[1])

Let's have a look at the dataset now to check if the Latitude and Longitude values are correctly populated

In [142]:
Reports_ServiceRequest.head()
Out[142]:
request_status date_received date_completed suburb category service_desc days_to_complete Date_Received Date_Completed Days_to_Complete Year Latitude Longitude
0 CLOSED 2016-05-05 2016-05-12 Carlton Graffiti Graffiti Removal 7 2016-05-05 2016-05-12 7.0 2016 -37.800423 144.968434
5 CLOSED 2014-12-01 2014-12-08 Carlton Graffiti Graffiti Removal 7 2014-12-01 2014-12-08 7.0 2014 -37.800423 144.968434
9 CLOSED 2014-12-06 2014-12-15 Parkville Graffiti Graffiti Removal 9 2014-12-06 2014-12-15 9.0 2014 -37.787115 144.951553
10 CLOSED 2014-12-11 2015-01-22 Melbourne Graffiti Graffiti Removal 42 2014-12-11 2015-01-22 42.0 2014 -37.814245 144.963173
12 CLOSED 2014-12-17 2015-01-08 Melbourne Graffiti Graffiti Removal 22 2014-12-17 2015-01-08 22.0 2014 -37.814245 144.963173

Since the Latitude and Longitude values are essential for joining the data, we need to ensure that the geolocation did not fail for any suburb and that all the service requests are populated with the Latitude and Longitude values for their respective suburbs. To do this we simply use the isnull() function to check for any missing values, and print those values.

In [146]:
# Check for missing coordinates
missing_coords = Reports_ServiceRequest[Reports_ServiceRequest['Latitude'].isnull() | Reports_ServiceRequest['Longitude'].isnull()]
if not missing_coords.empty:
    print("Missing coordinates for the following suburbs:")
    print(missing_coords['suburb'].unique())

missing_coords
Out[146]:
request_status date_received date_completed suburb category service_desc days_to_complete Date_Received Date_Completed Days_to_Complete Year Latitude Longitude

As we can see, it seems that our geolocation process was successful for each record in our Service Request Dataset. This is a good sign because getting rid of Null values would make our dataset even smaller.

Calculating the Distance using Haversine Formula¶

Now that we have the Latitude and Longitude values for each suburb, we need to use this information to join our pedestrian traffic and service request data. But we still don't have a location_id. How do we get use the geospatial information to fetch the correct location_id? Would the location of each sensor match the location of each service request origin?

The answer is a simple No.

To navigate through this complex issue, we use the Haversine Formula, which calculates the distance between any two given points on the earth's surface by taking their Latitude and Longitude values. This distance is calculted in Kilometres and can be used to perform a proximity analysis.

In [147]:
import numpy as np

# Haversine formula to calculate the distance in kilometers
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    phi1 = np.radians(lat1)
    phi2 = np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)

    a = np.sin(delta_phi / 2) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2) ** 2
    return R * 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
Proximity Analysis to Find the Closest Sensor¶

After using the Haversine method to calculate the distance between the origin of the Service Requests and the sensors, we can perform a proximity analysis to identify the closest sensor for each request. This is done using the function we defined where we identify the closest pedestrian sensor for each request and then store it's 'location_id' and description in a new dataframe. This proximity data will then be merged back into our original dataset.

Therefore, after performing these operations, we will have a dataset that contains all Graffiti and Waste removal service requests tagged with the closest sensor. We can use this to join our datasets. Voila!

This will help us identify which areas have historically experienced the most number of cases and how many pedestrian go through that area on a daily basis.

In [148]:
# Add a column for the closest pedestrian sensor for each service request
def find_closest_pedestrian(row, pedestrian_data):
    distances = pedestrian_data.apply(
        lambda x: haversine(row['Latitude'], row['Longitude'], x['Latitude'], x['Longitude']), axis=1
    )
    closest_sensor = pedestrian_data.loc[distances.idxmin()]
    return pd.Series({
        'closest_location_id': closest_sensor['location_id'],
        'closest_sensor_name': closest_sensor['sensor_name'],
        'closest_sensor_distance': distances.min()
    })

# Apply to service request data
service_requests_with_proximity = Reports_ServiceRequest.apply(
    find_closest_pedestrian, axis=1, pedestrian_data=daily_aggregated_data
)

# Merge the proximity data back into the original service request dataset
Reports_ServiceRequest = pd.concat([Reports_ServiceRequest, service_requests_with_proximity], axis=1)

Let's have a look and see how our approach has worked out. The closest_location_id must be populated for each service request

In [149]:
Reports_ServiceRequest.head()
Out[149]:
request_status date_received date_completed suburb category service_desc days_to_complete Date_Received Date_Completed Days_to_Complete Year Latitude Longitude closest_location_id closest_sensor_name closest_sensor_distance
0 CLOSED 2016-05-05 2016-05-12 Carlton Graffiti Graffiti Removal 7 2016-05-05 2016-05-12 7.0 2016 -37.800423 144.968434 37 Lyg260_T 0.141728
5 CLOSED 2014-12-01 2014-12-08 Carlton Graffiti Graffiti Removal 7 2014-12-01 2014-12-08 7.0 2014 -37.800423 144.968434 37 Lyg260_T 0.141728
9 CLOSED 2014-12-06 2014-12-15 Parkville Graffiti Graffiti Removal 9 2014-12-06 2014-12-15 9.0 2014 -37.787115 144.951553 44 UM3_T 1.575443
10 CLOSED 2014-12-11 2015-01-22 Melbourne Graffiti Graffiti Removal 42 2014-12-11 2015-01-22 42.0 2014 -37.814245 144.963173 2 Bou283_T 0.181827
12 CLOSED 2014-12-17 2015-01-08 Melbourne Graffiti Graffiti Removal 22 2014-12-17 2015-01-08 22.0 2014 -37.814245 144.963173 2 Bou283_T 0.181827

Let us also check what is the maximum distance between a service request and it's closest pedeestrian sensor. We want to make sure that each service request is tagged to the correct and nearest sensor, while ensuring that the distance does not exceed a 3 KM radius. We'll also remove the Latitude and Longitude columns since we would only require the closest_location_id

In [152]:
Reports_ServiceRequest = Reports_ServiceRequest.drop(columns=['Latitude','Longitude'])
Reports_ServiceRequest['closest_sensor_distance'].max()
Out[152]:
2.684795949325036
Aggregating the Number of Service Requests For Each Location¶

We need to now aggreagte our Service Request dataset by the closest_location_id before we can merge it with the pedestrian traffic data. The reason to do this is that we want to understand which locations have experienced a high number of historically recorded cases of Graffiti Vandalism or Illegal Waste Dumping.

We do this using the groupby function and rename the columns appropriately for a smooth merge. We will count the total number of cases for each location and storing it along with the location_id into a new dataframe. This will make it easier for us to identify any hotspots.

In [153]:
# Grouping data by 'closest_location_id' and counting occurrences
location_total_cases = Reports_ServiceRequest.groupby('closest_location_id').size().reset_index(name='count')

# Renaming column for clarity (optional)
location_total_cases.rename(columns={'closest_location_id': 'location_id'}, inplace=True)

# Preview the grouped data
location_total_cases.head()
Out[153]:
location_id count
0 2 2320
1 8 6
2 11 517
3 37 1041
4 44 645
Merging with Aggregated Pedestrian Traffic Data¶

Now the only thing left is merging our Service Request data into the Aggregated Pedestrian Traffic Data. We'll do that using the Pandas merge() function, while replacing any Null values with a zero to denote that there have not been any cases of Graffiti Vandalism or Waste Dumping reported at that location in the past, and we'll rename the columns to make sense

In [154]:
# Merge the grouped location counts with daily aggregated data
tmo_agg = pd.merge(daily_aggregated_data, location_total_cases, on='location_id', how='left')

# Imputing NA values
tmo_agg['count'] = tmo_agg['count'].fillna(0)

# Rename the count column to total_cases in the merged dataframe
tmo_agg.rename(columns={'count': 'total_cases'}, inplace=True)

Checking the Merged Aggregated DataFrame

In [155]:
tmo_agg.head()
Out[155]:
location_id sensor_description sensor_name Latitude Longitude sensing_date daily_traffic total_cases
0 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-11 3423 0.0
1 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-12 35198 0.0
2 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-13 22523 0.0
3 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 2320.0
4 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-12 18808 2320.0

Merging Developmental Activity Model Footprint Dataset¶

Now that we have successfully merged three of our datasets, we are only left with one.

This dataset is well maintained and has all the necessary information we require except the 'location_id' column. However, unlike the Service Request dataset, the development activity footprint contains uniquely identified developmental projects along with their addresses, and most importantly, Latitudes and Longitudes.

We have done this before, so using the Latitude and Longitude values to find the closest pedestrian sensor is not something we would worry about. The functions to calculate the distances using Haversine Formula already exists, and so does our function to find the closest pedestrian sensor. Hence, we will use the same functions to tag each developmental project to the closest 'location_id'.

In [156]:
# Apply the proximity calculation to the new dataset
developmental_with_proximity = Development_Activity_Footprint.apply(
    find_closest_pedestrian, axis=1, pedestrian_data=daily_aggregated_data
)

# Merge the proximity data back into the original developmental activity dataset
Development_Activity_Footprint = pd.concat([Development_Activity_Footprint, developmental_with_proximity], axis=1)

# Preview the results
Development_Activity_Footprint.head()
Out[156]:
dev_key status permit_num bldhgt_ahd base_ahd address num_floors land_use_1 land_use_2 land_use_3 shape_type datadate Latitude Longitude closest_location_id closest_sensor_name closest_sensor_distance
0 X000931 COMPLETED TPM-2013-4 321.0 2.3 68-82 Southbank Boulevard SOUTHBANK VIC 3006 99 Dwellings 1060 Car parks 600 Retail (sq.m) 659 tower 2024-10-24T12:53:51+00:00 -37.823159 144.963960 35 SouthB_T 0.346005
1 X001215 COMPLETED TPM-2010-31 31.6 1.9 312-320 City Road SOUTHBANK VIC 3006 43 Dwellings 494 Car parks 180 Retail (sq.m) 276 podium 2024-10-24T12:53:51+00:00 -37.827113 144.958224 140 Boyd2837_T 0.346208
2 X000930 COMPLETED TPM-2014-42 30.9 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 podium 2024-10-24T12:53:51+00:00 -37.824092 144.963729 140 Boyd2837_T 0.260416
3 X000930 COMPLETED TPM-2014-42 157.8 1.3 54-68 Kavagh Street SOUTHBANK VIC 3006 51 Dwellings 597 Car parks 417 Retail (sq.m) 557 tower 2024-10-24T12:53:51+00:00 -37.824162 144.963709 140 Boyd2837_T 0.253296
4 X000755 COMPLETED TPM-2014-33 21.6 3.0 68-70 Dorcas Street SOUTHBANK VIC 3006 28 Dwellings 238 Car parks 193 Retail (sq.m) 240 podium 2024-10-24T12:53:51+00:00 -37.830633 144.967467 140 Boyd2837_T 0.719933

The results look good!

Let us now drop the Latitude and Longitude columns because we won't be needing them anymore, and verify that the maximum distance for the closest sensor does not exceed thee limit of a 5KM radius, since we want to consider only the nearby development as a factor

In [157]:
Development_Activity_Footprint = Development_Activity_Footprint.drop(columns=['Latitude','Longitude'])
Development_Activity_Footprint['closest_sensor_distance'].max()
Out[157]:
3.225977905179067

By this time, we all know what to do next. We'll aggregate the developmental activity footprint so that we get a count of all the developmental projects that have been completed or are ongoing in each location. Since each project is identified uniquely by thee 'dev_key', we will use it to take our count and store the results in a new DataFrame

In [158]:
# Group by 'closest_location_id' and calculate total developmental activities
location_summary = Development_Activity_Footprint.groupby('closest_location_id').agg(
    total_developmental_activity=('dev_key', 'count')
).reset_index()


# Preview the summarized data
location_summary.head()
Out[158]:
closest_location_id total_developmental_activity
0 2 4
1 3 3
2 4 1
3 8 5
4 9 19
Performing the final merge¶

Now we just need to merge our final dataset into our existing aggregated dataset. We'll be using the same 'location_id' to again merge these DataFrames.

The resulting DataFrame will be a consolidated dataset that will tell us everything about the pedestrian sensors. This includes all the important factors that we want in this analysis and prediction, such as their location ID, the daily pedestrian traffic that each sensor observes, the historical number of cases that have been reported near that location and finally the number of construction or developmental projects completed or in progress near that location.

Before doing this, we will need to rename the closest_location_id column appropriately.

In [160]:
# Rename the count column to total_cases in the merged dataframe
location_summary.rename(columns={'closest_location_id': 'location_id'}, inplace=True)

tmo_agg = pd.merge(
    tmo_agg, 
    location_summary, 
    on='location_id', 
    how='left'
)

# Imputing NA values
tmo_agg['total_developmental_activity'] = tmo_agg['total_developmental_activity'].fillna(0)

# Preview the merged dataset
tmo_agg.head()
Out[160]:
location_id sensor_description sensor_name Latitude Longitude sensing_date daily_traffic total_cases total_developmental_activity
0 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-11 3423 0.0 0.0
1 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-12 35198 0.0 0.0
2 1 Bourke Street Mall (North) Bou292_T -37.813494 144.965153 2024-12-13 22523 0.0 0.0
3 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 2320.0 4.0
4 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-12 18808 2320.0 4.0
Analyzing the Distribution of Number of Cases¶

Let us use a histplot to plot the distribution of total number of cases for each location ID. This will help us analyze what is the frequency of such incidents, and what to expect in the future.

In [161]:
plt.figure(figsize=(10, 6))
sns.histplot(tmo_agg['total_cases'], bins=20, kde=True, color='blue')

# Customize the plot
plt.title('Distribution of Total Cases', fontsize=16)
plt.xlabel('Total Cases', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. We can observe that the distribution is right skewed.
  2. Most locations have experienced around 0-100 recorded cases historically.
  3. There is a long tail of location experiencing unusually high number of service requests. Goes up to 2500.

Let us know use a bar chart to analyze the same information but for each location ID.

In [162]:
# Sort data by total_cases for better visualization (optional)
tmo_agg = tmo_agg.sort_values(by='total_cases', ascending=False)

# Plot a bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='location_id', y='total_cases', data=tmo_agg, palette='viridis')

# Customize the plot
plt.title('Total Cases by Location ID', fontsize=16)
plt.xlabel('Location ID', fontsize=14)
plt.ylabel('Total Cases', fontsize=14)
plt.xticks(rotation=90)  # Rotate x-axis labels if there are many location IDs
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()  # Adjust layout for better spacing
plt.show()
C:\Users\natus\AppData\Local\Temp\ipykernel_8764\3206466295.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='location_id', y='total_cases', data=tmo_agg, palette='viridis')
No description has been provided for this image
Observation and Analysis¶
  1. The bar chart above showed the total number of cases for each location.
  2. We can observe that there are only a few locations that show a very high number of recorded service requests, ranging between 500 - 2500 over the span of three years.
  3. The rest of the locations experience only a negligible amount of service requests in comparison.
  4. The locations with high number of cases indicate to be existing hotspots.

Geospatial Analysis¶

Mapping the Pedestrian Sensor Locations¶

In [23]:
import folium
from folium.plugins import MarkerCluster
import pandas as pd

# Load daily aggregated data (example data provided above)
# Assuming the data is stored in a DataFrame named `daily_aggregated_data`

# Drop duplicates to get unique sensor locations
sensor_locations = daily_aggregated_data[['location_id', 'sensor_description', 'Latitude', 'Longitude']].drop_duplicates()

# Initialize a folium map centered on Melbourne
melbourne_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)

# Add sensor locations to the map using a marker cluster
marker_cluster = MarkerCluster().add_to(melbourne_map)

for _, row in sensor_locations.iterrows():
    folium.Marker(
        location=[row['Latitude'], row['Longitude']],
        popup=f"Sensor: {row['sensor_description']} (ID: {row['location_id']})"
    ).add_to(marker_cluster)

# Display map
melbourne_map
Out[23]:
Make this Notebook Trusted to load map: File -> Trust Notebook
What can you see from this map?¶
  1. The distribution of multiple blue markers across the Melbourne CBD grid can be seen concentrated along Bourke Street, Southern Cross, Lonsdale Street and Flinders Street.
  2. Upon clicking on any blue marker, it will pop up information on the sensor location and it's location ID as well.
  3. We have used clustering feature to group the high concentration of sensor in certain areas into a single cluster giving the number of sensors within that cluster.
  4. You can understand the distribution and placement of each sensor, and hence, can develop an understanding of where to expect high pedestrian traffic and otherwise.

Analysing the Daily Traffic by Pedestrian Sensor Location¶

In [24]:
# Aggregate traffic data by location
traffic_data = daily_aggregated_data.groupby(['location_id', 'sensor_description', 'Latitude', 'Longitude'])['daily_traffic'].sum().reset_index()

# Initialize a folium map
traffic_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)

# Add traffic-based circle markers
for _, row in traffic_data.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=row['daily_traffic'] / 1000,  # Scale radius for better visualization
        color='blue',
        fill=True,
        fill_opacity=0.6,
        popup=f"Sensor: {row['sensor_description']}<br>Traffic: {row['daily_traffic']}"
    ).add_to(traffic_map)

# Display map
traffic_map
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook
What can you see from this map?¶
  1. While larger blue circles indicate higher Pedestrian Traffic, it can be specifically observed between the Flinders Street and the Melbourne Central area.
  2. The smaller circles indiicate low pedestrian traffic in comparison.
  3. The Pedestrian footfall is seemingly low along the edges of CBD and near the Marvel Stadium.
  4. Upon clicking on any circle, you can check what is the daily traffic in that particular location.

Analyzing the Number of Recorded Graffiti Vandalism or Waste Dumping Cases by Location¶

In [31]:
import folium
import pandas as pd

# Aggregate the number of requests by suburb and calculate average latitude and longitude
suburb_requests = Reports_ServiceRequest.groupby('suburb').agg({
    'Latitude': 'mean',
    'Longitude': 'mean',
    'service_desc': 'count'  # Counting the number of service requests
}).rename(columns={'service_desc': 'request_count'}).reset_index()

# Initialize a folium map centered on Melbourne
suburb_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)

# Add circle markers for each suburb based on the number of requests
for _, row in suburb_requests.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=row['request_count'] / 50,  # Scale the radius for better visualization
        color='red',
        fill=True,
        fill_opacity=0.7,
        popup=f"Suburb: {row['suburb']}<br>Requests: {row['request_count']}"
    ).add_to(suburb_map)

# Display map
suburb_map
Out[31]:
Make this Notebook Trusted to load map: File -> Trust Notebook
What can you see from the map?¶
  1. The size of the red circles is directly proportional to the number of serrvice requests received by the City Council.
  2. We can observe a large red circle around Central Melbourne, indicating highest number of service requests made.
  3. Several suburbs surrounding the melbourne CBD like Carlton North and West Melbourne have also reported decent number of service requests.
  4. We can observe records of service requests from a wider geographical perspeective like Kensington, Flemington and Richmond.
  5. Docklands and Southbank experience surprisingly low number of service requests.
  6. Aggregation makes it easier to identify which suburbs record the most service requests which ultimately provide us with a valuable insight.
  7. Upon clicking the circles, you will get the total number of recorded cases in that suburb.

For our next visualization - Mapping Hotspots, we would need to combine the Service Request Data and the Daily Aggregated Pedestrian Traffic Data.

We will merge them using the already derived location IDs and drop all Null values, so that we get a dataset with a consolidated information on each Pedestrian Sensor, it's location and the total number of nearby cases.

In [83]:
graffiti_with_pedestrian = pd.merge(
    daily_aggregated_data,
    Reports_ServiceRequest,
    right_on='closest_location_id',
    left_on='location_id',
    how='left'
)

graffiti_with_pedestrian = graffiti_with_pedestrian.dropna()

graffiti_with_pedestrian.head()
Out[83]:
location_id sensor_description sensor_name Latitude Longitude sensing_date daily_traffic request_status date_received date_completed ... category service_desc days_to_complete Date_Received Date_Completed Days_to_Complete Year closest_location_id closest_sensor_name closest_sensor_distance
3 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 CLOSED 2014-12-11 2015-01-22 ... Graffiti Graffiti Removal 42.0 2014-12-11 2015-01-22 42.0 2014.0 2.0 Bou283_T 0.181827
4 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 CLOSED 2014-12-17 2015-01-08 ... Graffiti Graffiti Removal 22.0 2014-12-17 2015-01-08 22.0 2014.0 2.0 Bou283_T 0.181827
5 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 CLOSED 2014-12-31 2015-01-13 ... Graffiti Graffiti Removal 13.0 2014-12-31 2015-01-13 13.0 2014.0 2.0 Bou283_T 0.181827
6 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 CLOSED 2015-01-07 2015-01-20 ... Graffiti Graffiti Removal 13.0 2015-01-07 2015-01-20 13.0 2015.0 2.0 Bou283_T 0.181827
7 2 Bourke Street Mall (South) Bou283_T -37.813807 144.965167 2024-12-11 2073 CLOSED 2015-05-03 2015-05-06 ... Graffiti Graffiti Removal 3.0 2015-05-03 2015-05-06 3.0 2015.0 2.0 Bou283_T 0.181827

5 rows × 21 columns

Mapping Graffiti Vandalism and Waste Dumping Hotspots Across the City of Melbourne¶

In [85]:
import folium
from folium.plugins import MarkerCluster

# Initialize a folium map centered on Melbourne
map_graffiti_sensors = folium.Map(location=[-37.8136, 144.9631], zoom_start=13)

# Create a marker cluster for better visualization of overlapping markers
marker_cluster = MarkerCluster().add_to(map_graffiti_sensors)

# Iterate through the dataset and add markers for graffiti incidents and their nearest sensors
for _, row in graffiti_with_pedestrian.iterrows():
    # Add marker for the graffiti incident
    folium.Marker(
        location=[row['Latitude'], row['Longitude']],
        popup=(
            f"<b>Suburb:</b> {row['suburb']}<br>"
            f"<b>Category:</b> {row['category']}<br>"
            f"<b>Service Description:</b> {row['service_desc']}<br>"
            f"<b>Days to Complete:</b> {row['days_to_complete']} days<br>"
            f"<b>Nearest Sensor:</b> {row['closest_sensor_name']}<br>"
            f"<b>Distance to Sensor:</b> {row['closest_sensor_distance']:.2f} km"
        ),
        icon=folium.Icon(color="blue", icon="info-sign"),
    ).add_to(marker_cluster)

    # Add circle marker for the nearest sensor with traffic count as radius
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],  # Sensor coordinates
        radius=row['daily_traffic'] / 100,  # Scale traffic for visualization
        color='red',
        fill=True,
        fill_opacity=0.6,
        popup=(
            f"<b>Sensor Name:</b> {row['sensor_name']}<br>"
            f"<b>Traffic Count:</b> {row['daily_traffic']}<br>"
            f"<b>Sensor Description:</b> {row['sensor_description']}"
        ),
    ).add_to(marker_cluster)

# Save the map to an HTML file or display inline
map_graffiti_sensors.save("graffiti_incidents_with_traffic.html")
map_graffiti_sensors
Out[85]:
Make this Notebook Trusted to load map: File -> Trust Notebook
What can you see frorm the map?¶
  1. This map combines the Service Requests information along with the sensor location information.
  2. The Folium's MarkerCluster plugin helps us incorporate blue markers for Graffiti incidents and red circle markers for Pedestrian Traffic Sensors.
  3. The service requests have been tagged to the closest sensors based on a proximity analysis.
  4. We can observe a dense clustering of the blue markeres in Melbourne's CBD, particularly Southbank.
  5. We can gather all information by clicking on the blue markers. It tells us about the type of request made, the nearest sensor description, distance to the closest sensor, etc.

Preditctive Modelling¶

We have successfully analyzed our aggreagted data geospatially, to map Graffiti Vandalism and Waste Dumping hotspots across the city of Melbourne.

Now we come to the second part of this Use Case - Prediction. We will now feed this data to a model so that it can learn and predict possible hotspots in the future.

Grouping the Daily Sensing Traffic Data¶

We know that we have the sensing data for past 3 days, therefore we want to group by the data only by location_id, so that we don't have three different values of daily traffic. We will use the aggreagation function to do this, and then use the mean or the average value of daily traffic for that location_id.

In [68]:
# Group the data by 'location_id' and aggregate
grouped_agg_data = tmo_agg.groupby(['location_id']).agg({
    'total_cases': 'first',  # Take the first value
    'total_developmental_activity': 'first', # Take the first value
    'daily_traffic': 'mean' # Compute the mean (though it should be the same for each group)
}).reset_index()

# Display the grouped data
grouped_agg_data
Out[68]:
location_id total_cases total_developmental_activity daily_traffic
0 1 0.0 0.0 17053.666667
1 2 2320.0 4.0 9631.666667
2 3 0.0 3.0 34254.000000
3 4 0.0 1.0 20315.666667
4 5 0.0 0.0 13801.666667
... ... ... ... ...
86 161 0.0 0.0 1852.333333
87 162 0.0 8.0 6463.000000
88 164 0.0 18.0 12368.000000
89 165 0.0 43.0 688.333333
90 166 650.0 30.0 667.000000

91 rows × 4 columns

Labeling¶

For our predictive model, we need a target for the prediction. In this use case, we want to predict if a particular location is a possible vandalism hotpot or not.

Therefore, the next step will be labelling our data as either a hotspot or a non-hotspot. But how do we decide which location should be a hotspot?

To resolve this dillema, we will use the median value of the total_cases column, which contains the total number of cases historically for each location. This median will be considered as a threshold value, and any location having cases more than this threshold value will be a hotspot. Choosing median rather than the mean is going to be important, since outliers do not affect the median as much as they affect the mean. Hence, we will have a sensible and calculated threshold barrier.

We will use 0 and 1 encoding to label the data for convenience.

In [88]:
median_cases = grouped_agg_data['total_cases'].median()
# Create the hotspot label based only on total cases
grouped_agg_data['is_hotspot'] = (grouped_agg_data['total_cases'] > median_cases).astype(int)

grouped_agg_data
Out[88]:
location_id total_cases total_developmental_activity daily_traffic is_hotspot
0 1 0.0 0.0 17053.666667 0
1 2 2320.0 4.0 9631.666667 1
2 3 0.0 3.0 34254.000000 0
3 4 0.0 1.0 20315.666667 0
4 5 0.0 0.0 13801.666667 0
... ... ... ... ... ...
86 161 0.0 0.0 1852.333333 0
87 162 0.0 8.0 6463.000000 0
88 164 0.0 18.0 12368.000000 0
89 165 0.0 43.0 688.333333 0
90 166 650.0 30.0 667.000000 1

91 rows × 5 columns

Splitting and Scaling the Data¶

We will be segregating the features from the target and divide our data into 4 different datasets. Two of them will serve as training data, while the other two will be used to test our model and predict hotspots. We will be using the train_test_split package to do this and will divide our data in such a way that 60% of it remains available for training the model and the other 40% is kept for testing purpose.

Scaling is again important before we apply a model. It makes sure that all the features are transformed to the same scale making it easy to operate and compare.

In [97]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features: We will use 'total_cases', 'Latitude', and 'Longitude'
X = grouped_agg_data[['location_id', 'total_cases', 'total_developmental_activity', 'daily_traffic']]

# Target variable: 'is_hotspot'
y = grouped_agg_data['is_hotspot']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Applying the Model and Predicting Hotspots¶

The last step would be to fit the model on our data and then perform prediction on the test datasets. We will use accuracy and precision score to judge the performance of our model and the success of our Use Case.

In [163]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.67      1.00      0.80         2

    accuracy                           0.97        37
   macro avg       0.83      0.99      0.89        37
weighted avg       0.98      0.97      0.98        37

Comparing our Predicted Hotspots with the Actual Hotspots¶

Now that we have predicted the hotspots, we want to take a look at those prediction to understand how our model worked.

We will not compare our predicted values with the actual labelled values from our test dataset.

Encoded Values: 0 - Not a Hotspot ; 1 - Is a Hotspot

In [168]:
predictions_df = pd.DataFrame({
    'location_id': grouped_agg_data.loc[y_test.index, 'location_id'],
    'actual': y_test,
    'predicted': y_pred
})

# Show the first few test predictions
predictions_df
Out[168]:
location_id actual predicted
40 51 0 0
22 29 0 0
55 71 0 0
88 164 0 0
0 1 0 0
26 36 0 0
39 50 0 0
66 107 0 0
10 12 0 0
44 56 0 0
85 143 0 0
35 46 0 0
70 118 0 0
62 84 0 0
12 17 0 0
4 5 0 0
18 24 0 0
28 39 0 0
49 63 0 0
64 86 0 1
15 20 0 0
67 108 0 0
77 135 0 0
30 41 0 0
33 44 1 1
11 14 0 0
65 87 0 0
68 109 0 0
31 42 0 0
76 134 0 0
9 11 1 1
69 117 0 0
5 6 0 0
42 53 0 0
47 61 0 0
16 21 0 0
45 58 0 0

Checking Model Performance - Confusion Matrix¶

We have already looked at the comparison of predicted and actual results. However, to get summary of all the predictions and to analyze the performance of the model, we use a confusion matrix to identify the number of false predictions

In [165]:
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using Seaborn heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Hotspot', 'Hotspot'], yticklabels=['Non-Hotspot', 'Hotspot'])
plt.title('Confusion Matrix for Hotspot Prediction')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
No description has been provided for this image
Observation and Analysis¶
  1. From the confusion matrix, the rows represent actual labels, while the columns represent the predicted labels.
  2. The top left cell indicates that the model has correctly identified 34 locations as Non-Hotspots.
  3. From the top right cell, we can observe that the model has incorrectly classified one Hotspot as a Non-Hotspot instance.
  4. In the bottom right cell, we can see that the model has correctly identified two hotspot instances.
  5. The bottom left cell indicates that the model has no incorrectly predicted instances.
  6. The performance of our model is good. The confusion matrix helps us with a clear understanding of the model's performance by indicating the correct and incorrect predictions made.

Conclusion¶

  • Through this Use Case, we have efficiently integrated varioud datasets such as pedestrian traffic, service requests and developmental activity footprint.
  • Post integration of these datasets, we have identified, mapped and predicted hotspots for Graffiti Removal and Illegal Waste Dump Removal in the city of Melbourne.
  • We have utilized techniques such as geospatial analysis, clustering and predictive modelling to help us gain insights for the public authorities.
  • Through these insights, we have correlated high pedestrian traffic with Graffiti and Waste Dumping along with the developmental activity in those regions.
  • We can conclude that the intensive data driven approach has not only helped us gather insights but also equipped us with approaches to address urban challenges effectively.

Recommendations¶

  1. Hotspot Monitoring:
  • We could deploy additional sensors in the identified areas with high pedestrian traffic or developmental activities to better monitor such instances.
  • Consequently when new patterns emerge, geospatial clustering could help us dynamically predict the potential hotspots in Melbourne.
  1. Resource Allocation:
  • Based on the predictions given by our model, we could proactively allocate cleaning and maintenance resources.
  • This will help us reduce the response time in completing the service requests in future.
  • Preventive measures can be taken such as increasing the patrolling in that area to prevent any such incident.
  1. Spreading Awareness:
  • By organizing campaigns and eengaging the community, awareness can be spread about the impacts of Graffiti Vandalism and Illegal Waste Dumping on factors such as tourism, local businesses, etc.
  • Educate residents to report service incidents through a mobile app (which can be linked to the datasets we have used).
  1. Deploying Technology:
  • PowerBI dashboards and machine learning models could be deployed for supporting decision making and making real time predictions.
  • To improve coverage, IoT based sensors could be used to monitor inaccessible or densly populated areas.
  1. Improved Policy Making:
  • After identifying the potential hotspots, we could work with concerned policy makers to implement stringent regulations and fines for such offenses.
  • We could partner with waste management authorities to make provisions and reward appropriate waste disposal via community initiatives.

References¶

  1. City of Melbourne Open Data Library : https://data.melbourne.vic.gov.au/pages/home/
  2. Graffiti in Victoria : https://www.crimeprevention.vic.gov.au/resources/graffiti-in-victoria
  3. Property Damage and Graffiti : https://www.victimsofcrime.vic.gov.au/property-damage-and-graffiti
  4. Chopde, N.R. and Nichat, M., 2013. Landmark based shortest path detection by using A* and Haversine formula. International Journal of Innovative Research in Computer and Communication Engineering, 1(2), pp.298-302.
  5. Azdy, R.A. and Darnis, F., 2020, April. Use of haversine formula in finding distance between temporary shelter and waste end processing sites. In Journal of Physics: Conference Series (Vol. 1500, No. 1, p. 012104). IOP Publishing.
  6. GeoPandas Documentation : https://geopandas.org/en/stable/docs.html
  7. GeoPy Documentation : https://geopy.readthedocs.io/en/stable/
In [ ]: